5 research outputs found

    Toward Enhanced Metadata Quality of Large-Scale Digital Libraries: Estimating Volume Time Range

    Get PDF
    In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata records are incomplete or missing. Adding to the incomplete or missing metadata can greatly facilitate users' search and access to digital library resources. Temporal information, such as publication date, is a key descriptor of digital resources. In this study, we investigate text mining methods to automatically resolve missing publication dates for the HathiTrust corpora, a large collection of documents digitized by optical character recognition (OCR). In comparison with previous approaches using only unigrams as features, our experiment results show that methods incorporating higher order n-gram features, e.g., bigrams and trigrams, can more effectively classify a document into discrete temporal intervals or "chronons". Our approach can be generalized to classify volumes within other digital libraries.ye

    Diseases across the Top Five Languages of the PubMed Database: 1961-2012

    No full text
    <p>This visualization focuses on diseases in the biomedical literature in English, French, German, Japanese and Russian between 1961 and 2012. We mapped and visualized the titles of the articles from MEDLINE/PubMed, a database maintained by the U.S. National Library of Medicine (NLM) at the National Institutes of Health (NIH), to the categories of diseases from the International Classification of Diseases.</p

    Fileset: Diseases across the Top Five Languages in PubMed

    No full text
    <p>This fileset contains two files, a visualization and its desription, which were submitted to the WebSci'14 conference data visualization challenge (http://www.websci14.org/#call-for-data-visualization-challenge). The submission won the best student award. </p> <p>Suggested citation: Zoss, Angela; Edelblute, Trevor; Kouper, Inna. (2014): Diseases across the Top Five Languages of the PubMed Database: 1961-2012. doi: 10.6084/m9.figshare.1033878</p

    Diseases across the Top Five Languages of the PubMed Database: 1961-2012 // Description

    No full text
    <p>A short description of how the visualization of diseases across top five languages in pubmed has been produced.</p

    Data quality, transparency and reproducibility in large bibliographic datasets

    No full text
    Increasingly, large bibliographic databases are hosted by dedicated teams that commit to database quality, curation, and sharing, thereby providing excellent sources of data. Some databases, such as PubMed or HathiTrust Digital Library, offer APIs and describe the steps to retrieve or process their data. Others of comparable size and importance to bibliographic scholarship, such as the ACM digital library, still forbid data mining. The additional cleaning and expansion steps required to overcome barriers to data acquisition must be reproducible and incorporated into the curation pipeline, or the use of large bibliographic databases for analysis will remain costly, time-consuming, and inconsistent. In this presentation, we will describe our efforts to create reproducible workflows to generate datasets from three large bibliographic databases: PubMed, DBLP (as a proxy for the ACM digital library), and HathiTrust. We will compare these sources of bibliographic data and address the following: initial download and setup, gap analysis, supplemental sources for data retrieval and integration. By sharing our workflows and discussing both automated and manual steps of data enhancements, we hope to encourage researchers and data providers to think about sharing the responsibility of openness, transparency and reproducibility in re-using large bibliographic database
    corecore